1. Introdução


Qualquer viajante frequente já deve ter se perguntado que rotas as companhias aéreas oferecem a partir do lugar onde ele mora, de modo a poder viajar mais e sobretudo para lugares ainda desconhecidos. Para responder a esta pergunta, pode-se utilizar serviços web de pesquisa de voos, tais como o Skyscanner e o Google Flights, descobrindo a resposta trecho a trecho (ao mesmo tempo em que se verifica o preço da passagem), mas pode-se responder a esta mesma pergunta de forma mais sistemática, através de ferramentas de Data Science. A vantagem de sistematizar a informação acerca das rotas supracitadas, através de conjuntos de dados bem consolidados, é ir além das limitações encontradas nas ferramentas web tais quais as descritas anteriormente. Por exemplo:

1) Como garantir que nas ferramentas web existentes estão sendo exibidas todas as opções de rotas?

2) Como buscar voos com Stopover com quantidade de dias indeterminada?

3) Como buscar voos apenas de companhias aéreas que me dão pontos de fidelidade?

2. Fonte dos dados

É necessário que se tenha um banco de dados de rotas bem organizado. Felizmente, existem alguns conjuntos de dados deste tipo, por exemplo o disponibilizado pelo OpenFlights, que tem dados disponíveis acerca de:

  • Aeroportos;

  • Companhias Aéreas; e

  • Rotas de Voo.

    Pode-se fazer o download desses dados no endereço https://www.kaggle.com/open-flights/datasets , no qual estão distribuídos os 3 conjuntos de dados supracitados em formato CSV.

3. Possíveis Perguntas

De posso dos conjuntos de dados supracitados, pode-se elencar uma série de possíveis perguntas, tais quais as expostas a seguir:

1) Que países possuem mais companhias aéreas?

2) Que países possuem mais aeroportos?

3) Quais são os aeroportos existentes em uma localidade determinada?

4) Que cidades possuem mais aeroportos?

5) Que aeroportos estão localizados em altitudes extremas?

6) Em que países chegam mais rotas de voos?

7) De quais países saem mais companhias aéreas?

8) Que companhias aéreas possuem mais rotas de voo?

9) Que companhias aéreas vão a mais países?

10) Que rotas possuem a maior quantidade de voos?

11) Que rotas possuem a maior quantidade de companhias aéreas?

4. Análise dos Dados e Respostas às Perguntas

In [3]:
import pandas as pd 
import geopandas as gpd
import holoviews as hv 
import geoviews as gv
gv.extension('bokeh')
import numpy as np
from pyproj import Geod
from shapely.geometry import LineString, MultiLineString
import folium
%matplotlib inline
In [4]:
# Cias Aéreas

airlines = pd.read_csv('./data/airlines.csv', na_values='\\N')
print(len(airlines))
airlines.head()
6162
Out[4]:
Airline ID Name Alias IATA ICAO Callsign Country Active
0 -1 Unknown NaN - NaN NaN NaN Y
1 1 Private flight NaN - NaN NaN NaN Y
2 2 135 Airways NaN NaN GNL GENERAL United States N
3 3 1Time Airline NaN 1T RNX NEXTIME South Africa Y
4 4 2 Sqn No 1 Elementary Flying Training School NaN NaN WYT NaN United Kingdom N
In [5]:
# Apenas as cias aéreas ativas

airlines = airlines.loc[airlines['Active'] == 'Y', :].drop(columns='Active').dropna(subset=['Name'])
print(len(airlines))
aux = airlines.iloc[2:].head(10).drop(['Alias'], axis=1)
aux.to_html('./airlines_head.html')
aux
1255
Out[5]:
Airline ID Name IATA ICAO Callsign Country
3 3 1Time Airline 1T RNX NEXTIME South Africa
10 10 40-Mile Air Q5 MLA MILE-AIR United States
13 13 Ansett Australia AN AAA ANSETT Australia
14 14 Abacus International 1B NaN NaN Singapore
21 21 Aigle Azur ZI AAF AIGLE AZUR France
22 22 Aloha Airlines AQ AAH ALOHA United States
24 24 American Airlines AA AAL AMERICAN United States
28 28 Asiana Airlines OZ AAR ASIANA Republic of Korea
29 29 Askari Aviation 4K AAS AL-AAS Pakistan
32 32 Afriqiyah Airways 8U AAW AFRIQIYAH Libya
In [6]:
# Pergunta 1 - Que países possuem mais companhias aéreas?

%%opts Bars [height=300, width=800, color_index='country', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]
hv.Bars(airlines.groupby(['Country']).size().sort_values(ascending=False).iloc[:30].rename('airlines').to_frame(), 
        kdims=[('Country', 'country')], vdims=['airlines']).redim.range(airlines=(0, 160))
Out[6]:
In [7]:
# Brasil

aux = airlines[(airlines['Country'] == 'Brazil')].dropna(subset=['Callsign']).drop(['Alias', 'Country'], axis=1)
aux.to_html('./airlines_brazil.html')
aux
Out[7]:
Airline ID Name IATA ICAO Callsign
42 42 ABSA - Aerolinhas Brasileiras M3 TUS ABSA Cargo
1475 1476 Brazilian Air Force NaN BRS BRAZILIAN AIR FORCE
2580 2581 Gol Transportes Aéreos G3 GLO GOL TRANSPORTE
3759 3764 Oceanair O6 ONE OCEANAIR
3908 3914 Passaredo Transportes Aereos NaN PTB PASSAREDO
4857 4867 TAM Brazilian Airlines JJ TAM TAM
5177 5188 TRIP Linhas A 8R TIB TRIP
5337 5354 Varig Log LC VLO VELOG
5351 5368 VRG Linhas Aereas RG VRN VARIG
5356 5373 VASP VP VSP VASP
5382 5399 WebJet Linhas A WJ WEB WEB-BRASIL
5701 13306 BRAZIL AIR GB BZE BRAZIL AIR
5780 16234 Fly Brasil F1 FBL FBL
5813 16645 NEXT Brasil XB NXB XB
5841 16826 Whitejets NaN WTJ WHITEJET
In [8]:
# Aeroportos

airports = pd.read_csv('./data/airports.csv', na_values='\\N', header=None).rename(columns={0: 'Airport ID',
                        1: 'Name', 2: 'City', 3: 'Country', 4: 'IATA', 5: 'ICAO', 6: 'Latitude', 7: 'Longitude',
                        8: 'Altitude', 9: 'HoursFromUTC', 10: 'DST', 11: 'Timezone', 12: 'Type', 13: 'Source'})
print(len(airports))
airports.head()
10668
Out[8]:
Airport ID Name City Country IATA ICAO Latitude Longitude Altitude HoursFromUTC DST Timezone Type Source
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081690 145.391998 5282 10.0 U Pacific/Port_Moresby airport OurAirports
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.207080 145.789001 20 10.0 U Pacific/Port_Moresby airport OurAirports
2 3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826790 144.296005 5388 10.0 U Pacific/Port_Moresby airport OurAirports
3 4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10.0 U Pacific/Port_Moresby airport OurAirports
4 5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380 147.220001 146 10.0 U Pacific/Port_Moresby airport OurAirports
In [9]:
# Apenas aeroportos reais

airports = airports.loc[(airports['Type'] == 'airport') & (airports['Name'] != 'All Airports')].drop(columns=['Type', 'Source'])
print(len(airports))
airports.head().to_html('./airports_head.html')
airports.head()
7740
Out[9]:
Airport ID Name City Country IATA ICAO Latitude Longitude Altitude HoursFromUTC DST Timezone
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081690 145.391998 5282 10.0 U Pacific/Port_Moresby
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.207080 145.789001 20 10.0 U Pacific/Port_Moresby
2 3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826790 144.296005 5388 10.0 U Pacific/Port_Moresby
3 4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10.0 U Pacific/Port_Moresby
4 5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380 147.220001 146 10.0 U Pacific/Port_Moresby
In [10]:
# Pergunta 2 - Que países possuem mais aeroportos?

%%opts Bars [height=300, width=800, color_index='country', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]
hv.Bars(airports.groupby(['Country']).size().sort_values(ascending=False).iloc[:30].rename('airports').to_frame(), 
        kdims=[('Country', 'country')], vdims=['airports']).redim.range(airports=(0, 1700))
Out[10]:
In [11]:
# Pergunta 4 - Que cidades possuem mais aeroportos? 

%%opts Bars [height=300, width=800, color_index='city', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]
hv.Bars(airports.groupby(['City', 'Country']).size().sort_values(ascending=False).iloc[:30].rename('airports').to_frame(), 
        kdims=[('City', 'city')], vdims=['airports']).redim.range(airports=(0, 10))
Out[11]:
In [12]:
# Pergunta 5 - Que aeroportos estão localizados em altitudes extremas? (muito alta)

%%opts Bars [height=400, width=700, color_index='airport', show_legend=False, toolbar='above']
%%opts Bars [tools=['hover'], xrotation=30]
aux = airports.sort_values(by='Altitude', ascending=False).iloc[:10][['Name', 'Altitude']].copy()
aux.loc[:, 'Altitude'] = aux['Altitude']*0.3048
aux.loc[:, 'Name'] = aux['Name'].str.rsplit(n=1, expand=True)[0]
hv.Bars(aux, kdims=[('Name', 'airport')], vdims=[('Altitude', 'altitude (m.)')])
Out[12]:
In [13]:
# Pergunta 5 - Que aeroportos estão localizados em altitudes extremas? (muito baixa)  

%%opts Bars [height=400, width=700, color_index='airport', show_legend=False, toolbar='above']
%%opts Bars [tools=['hover'], xrotation=30]
aux = airports.sort_values(by='Altitude', ascending=True).iloc[:10][['Name', 'Altitude']].copy()
aux.loc[:, 'Altitude'] = aux['Altitude']*0.3048
aux.loc[:, 'Name'] = aux['Name'].str.rsplit(n=1, expand=True)[0]
hv.Bars(aux, kdims=[('Name', 'airport')], vdims=[('Altitude', 'altitude (m.)')])
Out[13]:
In [14]:
# Rotas 

routes = pd.read_csv('./data/flights.csv', na_values='\\N')
print(len(routes))
routes.head()
67663
Out[14]:
airline airline ID source airport source airport id destination airport destination airport id codeshare stops equipment
0 2B 410.0 AER 2965.0 KZN 2990.0 NaN 0 CR2
1 2B 410.0 ASF 2966.0 KZN 2990.0 NaN 0 CR2
2 2B 410.0 ASF 2966.0 MRV 2962.0 NaN 0 CR2
3 2B 410.0 CEK 2968.0 KZN 2990.0 NaN 0 CR2
4 2B 410.0 CEK 2968.0 OVB 4078.0 NaN 0 CR2
In [15]:
# Apenas rotas diretas com cia aereas e aeroportos conhecidos

cols = ['airline ID', 'source airport id', 'destination airport id']
routes = routes[routes['stops'] == 0].dropna(how='any', subset=cols).drop(columns='stops')
for c in cols:
    routes.loc[:, c] = routes[c].astype('int')
print(len(routes))
routes.head().to_html('./routes_head.html')
routes.head()
66754
Out[15]:
airline airline ID source airport source airport id destination airport destination airport id codeshare equipment
0 2B 410 AER 2965 KZN 2990 NaN CR2
1 2B 410 ASF 2966 KZN 2990 NaN CR2
2 2B 410 ASF 2966 MRV 2962 NaN CR2
3 2B 410 CEK 2968 KZN 2990 NaN CR2
4 2B 410 CEK 2968 OVB 4078 NaN CR2
In [16]:
# Juntando tudo

cols = ['Name', 'City', 'Country', 'Latitude', 'Longitude']
df = (pd.merge(routes, airports, left_on='source airport id', right_on='Airport ID', how='inner')
    .rename(columns={c:'source ' + c.lower() for c in cols})
    .drop(airports.columns, axis=1, errors='ignore')
    .merge(airports, left_on='destination airport id', right_on='Airport ID', how='inner')
    .rename(columns={c:'destination ' + c.lower() for c in cols})
    .drop(airports.columns, axis=1, errors='ignore')
    .merge(airlines, left_on='airline ID', right_on='Airline ID', how='inner')
    .rename(columns={'Name': 'airline name'})
    .drop(airlines.columns, axis=1, errors='ignore')
    .sort_index(axis=1)
)
print(len(df))
df.head()
65534
Out[16]:
airline airline ID airline name codeshare destination airport destination airport id destination city destination country destination latitude destination longitude destination name equipment source airport source airport id source city source country source latitude source longitude source name
0 2B 410 Aerocondor NaN KZN 2990 Kazan Russia 55.606201 49.278702 Kazan International Airport CR2 AER 2965 Sochi Russia 43.449902 39.956600 Sochi International Airport
1 2B 410 Aerocondor NaN KZN 2990 Kazan Russia 55.606201 49.278702 Kazan International Airport CR2 ASF 2966 Astrakhan Russia 46.283298 48.006302 Astrakhan Airport
2 2B 410 Aerocondor NaN KZN 2990 Kazan Russia 55.606201 49.278702 Kazan International Airport CR2 CEK 2968 Chelyabinsk Russia 55.305801 61.503300 Chelyabinsk Balandino Airport
3 2B 410 Aerocondor NaN KZN 2990 Kazan Russia 55.606201 49.278702 Kazan International Airport CR2 DME 4029 Moscow Russia 55.408798 37.906300 Domodedovo International Airport
4 2B 410 Aerocondor NaN KZN 2990 Kazan Russia 55.606201 49.278702 Kazan International Airport CR2 EGO 6156 Belgorod Russia 50.643799 36.590099 Belgorod International Airport
In [17]:
aux = df[['airline name', 'destination name', 'destination city', 'destination country', 'destination latitude', 
   'destination longitude', 'source name', 'source city', 'source country', 'source latitude', 
   'source longitude']].head()
aux.to_html('./routes_joined_head.html')
aux
Out[17]:
airline name destination name destination city destination country destination latitude destination longitude source name source city source country source latitude source longitude
0 Aerocondor Kazan International Airport Kazan Russia 55.606201 49.278702 Sochi International Airport Sochi Russia 43.449902 39.956600
1 Aerocondor Kazan International Airport Kazan Russia 55.606201 49.278702 Astrakhan Airport Astrakhan Russia 46.283298 48.006302
2 Aerocondor Kazan International Airport Kazan Russia 55.606201 49.278702 Chelyabinsk Balandino Airport Chelyabinsk Russia 55.305801 61.503300
3 Aerocondor Kazan International Airport Kazan Russia 55.606201 49.278702 Domodedovo International Airport Moscow Russia 55.408798 37.906300
4 Aerocondor Kazan International Airport Kazan Russia 55.606201 49.278702 Belgorod International Airport Belgorod Russia 50.643799 36.590099
In [18]:
# Pergunta 6 - Em que países chegam mais rotas de voos? 

%%opts Bars [height=300, width=800, color_index='country', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

busiest_countries_in = df['destination country'].value_counts().iloc[:30]
hv.Bars(busiest_countries_in, kdims=('index', 'country'), vdims=('destination country', 'routes'))
Out[18]:
In [19]:
# Pergunta 7 - De quais países saem mais companhias aéreas?

%%opts Bars [height=300, width=800, color_index='country', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

open_countries = df.groupby('source country')['airline'].nunique().sort_values(ascending=False).iloc[:30]
hv.Bars(open_countries, kdims=('source country', 'country'), vdims=('airline', 'airlines_from'))
Out[19]:
In [20]:
# Pergunta 8 - Que companhias aéreas possuem mais rotas de voo?

%%opts Bars [height=300, width=800, color_index='airline', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

busiest_airlines = df['airline name'].value_counts().iloc[:30]
hv.Bars(busiest_airlines, kdims=('index', 'airline'), vdims=('airline name', 'routes'))
Out[20]:
In [21]:
# Pergunta 9 - Que companhias aéreas vão a mais países?

%%opts Bars [height=300, width=800, color_index='airline', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

international_airlines = df.groupby('airline name')['destination country'].nunique().sort_values(ascending=False).iloc[:30]
hv.Bars(international_airlines, kdims=('airline name', 'airline'), vdims=('destination country', 'countries_to'))
Out[21]:
In [22]:
# Pergunta 10 - Que rotas possuem a maior quantidade de voos? (chegabdo)

%%opts Bars [height=300, width=800, color_index='city', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

popular_cities = (df.groupby(['destination country', 'destination city']).size().rename('routes_to')
                    .sort_values(ascending=False).iloc[:30].reset_index())
hv.Bars(popular_cities, kdims=('destination city', 'city'), vdims=['routes_to'])
Out[22]:
In [23]:
# Pergunta 10 - Que rotas possuem a maior quantidade de voos? (saindo)

%%opts Bars [height=300, width=800, color_index='city', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

reachable_countries = (df.groupby(['source country', 'source city'])['destination country'].nunique()
                           .sort_values(ascending=False).iloc[:30].reset_index())
hv.Bars(reachable_countries, kdims=('source city', 'city'), vdims=('destination country', 'countries_to'))
Out[23]:
In [26]:
# Pergunta 11 - Que rotas possuem a maior quantidade de companhias aéreas?

%%opts Bars [height=300, width=800, color_index='route', show_legend=False]
%%opts Bars [tools=['hover'], xrotation=45]

top_routes = (df.groupby(['source country', 'source city', 'destination city', 'destination country'])['airline']
                  .nunique().sort_values(ascending=False).iloc[:30].reset_index()).copy()
top_routes.loc[:, 'route'] = top_routes['source city'] + '-' + top_routes['destination city']
hv.Bars(top_routes, kdims=['route'], vdims=[('airline', 'num_airlines')])
Out[26]:
In [27]:
def routes_from(city, geodesic=False):
    
   # Rotas por cidade
   
    depart_routes = df[(df['source city'] == city)].copy()
    depart_routes.loc[:, 'destination str'] = (depart_routes['destination name'].str.replace('Airport', '') + '(' 
                + depart_routes['destination airport'] + ')')
    depart_routes.loc[:, 'source str'] = (depart_routes['source name'].str.replace('Airport', '') + '(' 
                + depart_routes['source airport'] + ')')
    depart_routes.sort_index(axis=1, inplace=True)
    
    # rotas por cia aérea
    
    aux = (depart_routes.groupby(['source str', 'destination str'])
       .agg({'airline name': 'unique', 
             'source latitude': 'first', 'source longitude': 'first', 'source city': 'first',
             'destination latitude': 'first', 'destination longitude': 'first', 'destination city': 'first'}))
    aux.rename(columns={'airline name': 'airlines'}, inplace=True)
    aux.loc[:, 'num_airlines'] = aux['airlines'].apply(len)
    aux.loc[:, 'airlines'] = aux['airlines'].str.join(', ')    # de lista para string
    aux.reset_index(inplace=True)
    
    dest_airports = aux.filter(regex='destination *').drop_duplicates()
    src_airports = aux.filter(regex='source *').drop_duplicates()
    src_airports.columns = dest_airports.columns     
    locs = pd.concat([src_airports, dest_airports], axis=0)
    nodes = gv.Nodes(locs, kdims=[('destination longitude', 'lon'), ('destination latitude', 'lat'), 
                                  ('destination str', 'airport')], vdims=[('destination city', 'city')])
    if geodesic:
        edges = (aux
         .apply(lambda row: LineString(np.array([(row['source longitude'], row['source latitude'])] + 
                Geod(ellps='clrk66').npts(row['source longitude'], row['source latitude'], 
                       row['destination longitude'], row['destination latitude'], npts=100) +    # 100 points
                [(row['destination longitude'], row['destination latitude'])])), axis=1))
        edgepaths = gv.EdgePaths(gpd.GeoDataFrame(geometry=edges.apply(repair_line)), 
                                              kdims=[('destination longitude', 'lon'), 
                                                     ('destination latitude', 'lat')])
    else:
        edgepaths = None
    graph = gv.Graph((aux[['source str', 'destination str', 'num_airlines', 'airlines']], nodes, edgepaths), 
         kdims=[('source str', 'from'), ('destination str', 'to')], vdims=['airlines'])
    return graph
In [28]:
def repair_line(l):
    coords = list(l._get_coords())
    res = [] ; prev_lon = 0.0 ; prev_lat = 0.0 ; ini = 0
    for pos, (lon, lat) in enumerate(coords):
        if (prev_lon < -150.0 and lon > 150.0) or (prev_lon > 150.0 and lon < -150.0): 
            res.append(coords[ini:pos]) ; ini = pos
        prev_lon, prev_lat = lon, lat
    res.append(coords[ini:pos])
    return MultiLineString(res) if len(res) > 1 else l
In [29]:
%%opts Graph [width=700, height=700, xaxis=None, yaxis=None, inspection_policy='edges'] (edge_line_alpha=0.7)
%%opts Graph [tools=['hover'], toolbar='above', global_extent=True] (edge_line_width=3, edge_line_color='black')
%%opts Graph (node_fill_color='red', node_alpha=0.8, node_line_color='black', node_size=6, node_hover_fill_color='yellow')
%%opts Graph (edge_hover_line_color='green', edge_hover_line_alpha=1.0, edge_line_dash=(5,3))
hm = gv.HoloMap({city: routes_from(city, geodesic=True) 
                 for city in reachable_countries['source city']}, kdims=['city'])
gv.tile_sources.EsriImagery * hm
Out[29]:
In [ ]: